{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Arbitrary value imputation\n",
    "\n",
    "Replacing the NA by artitrary values should be used when there are reasons to believe that the NA are not missing at random. In situations like this, we would not like to replace with the median or the mean, and therefore make the NA look like the majority of our observations.\n",
    "\n",
    "Instead, we want to flag them. We want to capture the missingness somehow.\n",
    "\n",
    "In previous lectures we saw 2 methods to do this:\n",
    "\n",
    "1) adding an additional binary variable to indicate whether the value is missing (1) or not (0)\n",
    "\n",
    "2) replacing the NA by a value at a far end of the distribution\n",
    "\n",
    "Here, I suggest an alternative to option 2, which I have seen in several Kaggle competitions. It consists of replacing the NA by an arbitrary value. Any of your creation, but ideally different from the median/mean/mode, and not within the normal values of the variable.\n",
    "\n",
    "The problem consists in deciding which arbitrary value to choose.\n",
    "\n",
    "### Advantages\n",
    "\n",
    "- Easy to implement\n",
    "- Captures the importance of missingess if there is one\n",
    "\n",
    "### Disadvantages\n",
    "\n",
    "- Distorts the original distribution of the variable\n",
    "- If missingess is not important, it may mask the predictive power of the original variable by distorting its distribution\n",
    "- Hard to decide which value to use\n",
    " If the value is outside the distribution it may mask or create outliers\n",
    "\n",
    "### Final note\n",
    "\n",
    "When variables are captured by third parties, like credit agencies, they place arbitrary numbers already to signal this missingness. So if not common practice in data competitions, it is common practice in real life data collections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "===============================================================================\n",
    "\n",
    "## Real Life example: \n",
    "\n",
    "### Predicting Survival on the Titanic: understanding society behaviour and beliefs\n",
    "\n",
    "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
    "\n",
    "=============================================================================\n",
    "\n",
    "In the following cells, I will show how this procedure impacts features and machine learning using the Titanic and House Price datasets from Kaggle.\n",
    "\n",
    "If you haven't downloaded the datasets yet, in the lecture \"Guide to setting up your computer\" in section 1, you can find the details on how to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# for classification\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "# to split and standarize the datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# to evaluate classification models\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>7.2500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>38.0</td>\n",
       "      <td>71.2833</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>26.0</td>\n",
       "      <td>7.9250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>35.0</td>\n",
       "      <td>53.1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8.0500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Survived   Age     Fare\n",
       "0         0  22.0   7.2500\n",
       "1         1  38.0  71.2833\n",
       "2         1  26.0   7.9250\n",
       "3         1  35.0  53.1000\n",
       "4         0  35.0   8.0500"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load the Titanic Dataset with a few variables for demonstration\n",
    "\n",
    "data = pd.read_csv('titanic.csv', usecols = ['Age', 'Fare','Survived'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Survived    0.000000\n",
       "Age         0.198653\n",
       "Fare        0.000000\n",
       "dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's look at the percentage of NA\n",
    "data.isnull().mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((623, 3), (268, 3))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's separate into training and testing set\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(data, data.Survived, test_size=0.3,\n",
    "                                                    random_state=0)\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def impute_na(df, variable):\n",
    "    df[variable+'_zero'] = df[variable].fillna(0)\n",
    "    df[variable+'_hundred']= df[variable].fillna(100)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Age_zero</th>\n",
       "      <th>Age_hundred</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>857</th>\n",
       "      <td>1</td>\n",
       "      <td>51.0</td>\n",
       "      <td>26.5500</td>\n",
       "      <td>51.0</td>\n",
       "      <td>51.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>1</td>\n",
       "      <td>49.0</td>\n",
       "      <td>76.7292</td>\n",
       "      <td>49.0</td>\n",
       "      <td>49.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>386</th>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>46.9000</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124</th>\n",
       "      <td>0</td>\n",
       "      <td>54.0</td>\n",
       "      <td>77.2875</td>\n",
       "      <td>54.0</td>\n",
       "      <td>54.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>578</th>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.4583</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>549</th>\n",
       "      <td>1</td>\n",
       "      <td>8.0</td>\n",
       "      <td>36.7500</td>\n",
       "      <td>8.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>247.5208</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>20.0</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>157</th>\n",
       "      <td>0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>30.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>7.1417</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>653</th>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.8292</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>235</th>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.5500</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>785</th>\n",
       "      <td>0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>25.0</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>241</th>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15.5000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351</th>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.0000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>862</th>\n",
       "      <td>1</td>\n",
       "      <td>48.0</td>\n",
       "      <td>25.9292</td>\n",
       "      <td>48.0</td>\n",
       "      <td>48.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>851</th>\n",
       "      <td>0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>7.7750</td>\n",
       "      <td>74.0</td>\n",
       "      <td>74.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>753</th>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>7.8958</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>532</th>\n",
       "      <td>0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>7.2292</td>\n",
       "      <td>17.0</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>485</th>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25.4667</td>\n",
       "      <td>0.0</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Survived   Age      Fare  Age_zero  Age_hundred\n",
       "857         1  51.0   26.5500      51.0         51.0\n",
       "52          1  49.0   76.7292      49.0         49.0\n",
       "386         0   1.0   46.9000       1.0          1.0\n",
       "124         0  54.0   77.2875      54.0         54.0\n",
       "578         0   NaN   14.4583       0.0        100.0\n",
       "549         1   8.0   36.7500       8.0          8.0\n",
       "118         0  24.0  247.5208      24.0         24.0\n",
       "12          0  20.0    8.0500      20.0         20.0\n",
       "157         0  30.0    8.0500      30.0         30.0\n",
       "127         1  24.0    7.1417      24.0         24.0\n",
       "653         1   NaN    7.8292       0.0        100.0\n",
       "235         0   NaN    7.5500       0.0        100.0\n",
       "785         0  25.0    7.2500      25.0         25.0\n",
       "241         1   NaN   15.5000       0.0        100.0\n",
       "351         0   NaN   35.0000       0.0        100.0\n",
       "862         1  48.0   25.9292      48.0         48.0\n",
       "851         0  74.0    7.7750      74.0         74.0\n",
       "753         0  23.0    7.8958      23.0         23.0\n",
       "532         0  17.0    7.2292      17.0         17.0\n",
       "485         0   NaN   25.4667       0.0        100.0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's replace the NA with the median value in the training set\n",
    "impute_na(X_train, 'Age')\n",
    "impute_na(X_test, 'Age')\n",
    "\n",
    "X_train.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set\n",
      "Logistic Regression roc-auc: 0.6863462831608859\n",
      "Test set\n",
      "Logistic Regression roc-auc: 0.7137499999999999\n",
      "Train set\n",
      "Logistic Regression roc-auc: 0.6803594282119694\n",
      "Test set\n",
      "Logistic Regression roc-auc: 0.7227976190476191\n"
     ]
    }
   ],
   "source": [
    "# we compare the models built using Age filled with zero, vs Age filled with 100\n",
    "\n",
    "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
    "logit.fit(X_train[['Age_zero','Fare']], y_train)\n",
    "print('Train set')\n",
    "pred = logit.predict_proba(X_train[['Age_zero','Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = logit.predict_proba(X_test[['Age_zero','Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
    "\n",
    "logit = LogisticRegression(random_state=44, C=1000) # c big to avoid regularization\n",
    "logit.fit(X_train[['Age_hundred','Fare']], y_train)\n",
    "print('Train set')\n",
    "pred = logit.predict_proba(X_train[['Age_hundred','Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set')\n",
    "pred = logit.predict_proba(X_test[['Age_hundred','Fare']])\n",
    "print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train set zero imputation\n",
      "Random Forests roc-auc: 0.7555855621353116\n",
      "Test set zero imputation\n",
      "Random Forests zero imputation roc-auc: 0.7490476190476191\n",
      "\n",
      "Train set median imputation\n",
      "Random Forests roc-auc: 0.7490781111038807\n",
      "Test set median imputation\n",
      "Random Forests roc-auc: 0.7653571428571431\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# random forests\n",
    "\n",
    "rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)\n",
    "rf.fit(X_train[['Age_zero', 'Fare']], y_train)\n",
    "print('Train set zero imputation')\n",
    "pred = rf.predict_proba(X_train[['Age_zero', 'Fare']])\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set zero imputation')\n",
    "pred = rf.predict_proba(X_test[['Age_zero', 'Fare']])\n",
    "print('Random Forests zero imputation roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
    "print()\n",
    "rf = RandomForestClassifier(n_estimators=100, random_state=39, max_depth=3)\n",
    "rf.fit(X_train[['Age_hundred', 'Fare']], y_train)\n",
    "print('Train set median imputation')\n",
    "pred = rf.predict_proba(X_train[['Age_hundred', 'Fare']])\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))\n",
    "print('Test set median imputation')\n",
    "pred = rf.predict_proba(X_test[['Age_hundred', 'Fare']])\n",
    "print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that replacing NA with 100 makes the models perform better than replacing NA with 0. This is, if you remember from the lecture \"Replacing NA by mean or median\" because children were more likely to survive than adults. Then filling NA with zeroes, distorts this relation and makes the models loose predictive power. See below for a re-cap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average real survival of children:  0.5740740740740741\n",
      "Average survival of children when using Age imputed with zeroes:  0.38857142857142857\n",
      "Average survival of children when using Age imputed with median:  0.5740740740740741\n"
     ]
    }
   ],
   "source": [
    "print('Average real survival of children: ', X_train[X_train.Age<15].Survived.mean())\n",
    "print('Average survival of children when using Age imputed with zeroes: ', X_train[X_train.Age_zero<15].Survived.mean())\n",
    "print('Average survival of children when using Age imputed with median: ', X_train[X_train.Age_hundred<15].Survived.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final notes\n",
    "\n",
    "The arbitrary value has to be determined for each variable specifically. For example, for this dataset, the choice of replacing NA in age by 0 or 100 are valid, because none of those values are frequent in the original distribution of the variable, and they lie at the tails of the distribution.\n",
    "\n",
    "However, if we were to replace NA in fare, those values are not good any more, because we can see that fare can take values of up to 500. So we might want to consider using 500 or 1000 to replace NA instead of 100.\n",
    "\n",
    "As you can see this is totally arbitrary. And yet, it is used in the industry.\n",
    "\n",
    "Typical values chose by companies are -9999 or 9999, or similar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
